Sentiment Analysis with IBM Debater Sentiment Composition Lexicons Dataset

In this notebook, you will explore how to infer sentiments from a document using the IBM Debater® Sentiment Composition Lexicons dataset. The dataset includes sentiment composition lexicons and sentiment lexicons:

This dataset can be obtained for free from IBM Developer Data Asset Exchange.

Table of Contents

0. Prerequisites

Before you run this notebook complete the following steps:

Insert a project token

When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:

# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context

If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:

ws-project.mov

1. Sentiment Analysis

Using the IBM Debater Sentiment Composition Lexicons dataset, the goal is to take text and capture sentiment on the entire text, as well as for each sentence in the text. The model is based on the Learning Sentiment Composition from Sentiment Lexicons paper, writen by the creators of the dataset. The final model is section 1.5, which combines steps 1.2 to 1.4.

In the following subsections, you will:

1.1 Get Data Files Paths

Start by extracting the paths of the dataset that was stored to the project assets in the project notebook Part 1 - Data Exploration & Visualization, as well as rule files 'ADJECTIVES.xlsx' and 'SEMANTIC_CLASSES.xlsx' from the original dataset.

Note: The code below assumes you have already run the Data Exploration & Visualization notebook prior to running this notebook.

1.2 Unigrams

In this first implementation (calculate_unigram_sentiment()), you will use the unigrams dataset, which contains the sentiment of various unigrams. Given an input sentence, the sentence is tokenized (broken up as a list of words), and then each word in the sentence is matched to the words in the unigram dataset to find its sentiment value. If the word is not found in the dataset, it is skipped.

Below is an example using this simple implementation.

1.3 Bigrams

This second implementation (calculate_bigram_sentiment()) is similar to the first unigrams implementation, however it uses bigrams instead. Given an input sentence, the sentence is tokenized into bigrams (for example, "the good dog" becomes something like [("the", "good"), ("good", "dog")]), and then each bigram in the sentence is matched to the bigrams in the dataset to find its sentiment value. If a bigram is not found in the dataset, it is skipped.

Below is an example using the bigram implementation.

1.4 Using Composition and Adjective Classes

Composition and Adjective Class

Uisng the rules from Table 1 of the Learning Sentiment Composition from Sentiment Lexicons paper, the sentiment analysis model is created to produce sentiment scores. Additionally, bigrams are matched to certain rules that produce a predicted polarity (positive or negative). There are two groups of rules: composition classes and adjective classes. Adjective classes focus on the adjective pairs (high, low) and (fast, slow).

In order to do this, there are two files: 1) ADJECTIVES.xlsx and 2) SEMANTIC_CLASSES.xlsx. The adjectives file contains 5 sheets. The first sheet gives a list of words similar to each of high, low, fast, slow. The next four sheets are words that are associated with that specific case.

The semantic classes file has 6 sheets, one for each of the composition classes defined in the paper. In each sheet, there is a list of words that corresponds to that composition class.

ADJECTIVES.xlsx

This file contains the lists of the semantic classes words for the gradable adjective pairs.

SEMANTIC_CLASSES.xlsx:

This file contains the lists of the semantic classes words for each type. For each semantic class (reversers, propagators, and dominators), there are two tabs in the Excel file. One for a positive composition (POS), and one for negative composition (NEG). There are 6 tabs in total: DOMINATOR_NEG, DOMINATOR_POS, PROPAGETOR_POS, PROPAGETOR_NEG, REVERSER_POS, REVERSER_NEG.

1.4.1 Reading Adjective Classes Data Files

First, read in ADJECTIVE_EXPANSION.xlsx and clean it up.

Now we have a dictionary that matches words to the adjective class (fast, high, low, slow). The dictionary looks like: {word: adjective_class}

1.4.2 Reading Compostion Classes Data Files

Next, read the SEMANTIC_CLASSES.xlsx file.

1.4.3 Matching Adjective/Composition Classes

Now write a method to match bigrams to an adjective or composition class. As stated in the paper, the bigram matching order will be: ADJ (adjective), REV (reverse), PROP (propagator), DOM (dominator)

The sentiment of sentences can now be calculated. For example:

A value < 0 means a negative sentiment and a value > 0 means positive. This means that in the example above, the sentence has a negative sentiment. The tuple printed out is the bigram that was matched to determine the sentiment.

1.5 Combining Unigram, Bigram, Component/Adj Classes

In this section, the techniques from 1.1 to 1.3 are combined to calculate sentiment. The first step is to get bigrams of the text. Then, to determine the final sentiment of each bigram:

  1. Take the bigram score (1.3, calulate_bigram_sentiment()). If this does not exist then,
  2. Take the score from matching component/adj (1.4, calculate_compostion_or_adj_sentiment()). If this does not exist then,
  3. Look at the unigrams (1.2, calulate_bigram_sentiment()) of the bigram. Both words need to be negative in order to negative (similar for positive). If one is positive and one negative then it is neutral

Here is an example using the combined model. As stated previously, a score < 0 is negative sentiment, score > 0 is positive, and score = 0 is neutral.

1.6 Group by Overall Sentiment

Now, add a method to group the entire comment's overall sentiment rather than just its individual sentences. Within each comment, the positive/negative sentences are also labeled.

Example

In the following example, the sentiment of each comment is visualized in a more human readable way, as well as each sentence in each comment.

In this example, two comments are overall negative, two comments overall are positive, and one is neutral.

Summary

In this notebook you learned:

You can extend the concepts learned here and create a sample application such as the Customer Online Comments Organizer.

Authors

This notebook was created by the Center for Open-Source Data & AI Technologies.


Copyright © 2021 IBM. This notebook and its source code are released under the terms of the MIT License.